Application of two-component neural network for exchange-correlation functional interpolation

Density functional theory (DFT) is one of the primary approaches to solving the many-body Schrodinger equation. The essential part of the DFT theory is the exchange-correlation (XC) functional, which can not be obtained in analytical form. Accordingly, the accuracy improvement of the DFT is mainly based on the development of XC functional approximations. Commonly, they are built upon analytic solutions in low- and high-density limits and result from quantum Monte Carlo or post-Hartree-Fock numerical calculations. However, there is no universal functional form to incorporate these data into XC functional. Instead, various parameterizations use heuristic rules to build a specific XC functional. The neural network (NN) approach to interpolate the data from higher precision theories can give a unified path to parametrize an XC functional. Moreover, data from many existing quantum chemical databases could provide the XC functional with improved accuracy. We develop NN XC functional, which gives exchange potential and energy density without direct derivatives of exchange-correlation energy density. Proposed NN architecture consists of two parts NN-E and NN-V, which could be trained in separate ways, adding new flexibility to XC functional. We also show that the developed NN XC functional converges in the self-consistent cycle and gives reasonable energies when applied to atoms, molecules, and crystals.

www.nature.com/scientificreports/ functionals form [21][22][23][24][25][26] . Work by Nagai et al. 22 is especially worthy of note. They presented the first working NN XC functional, which gives better accuracy than traditional ones. Despite significant advances in the development of XC functionals based on NNs, a wide range of issues remain. Namely, how to relate the exchange-correlation energy density ( ε xc ) and the exchange-correlation potential (XC potential, V xc )? In the analytical description, such connection is provided by a standard differentiation, which is sometimes tedious but straightforward. But when new features are included in the NN, for example, the logarithm of density (standard feature scaling procedure) or local Hartree-Fock exchange energy densities 27 , it is not clear how to incorporate them into the connection between XC potential and corresponding energy density. It is also essential to find a way to include physical insight into XC functionals, such as asymptotic analytical solutions and conservation laws. Another issue is an optimal NN architecture and feature selection for presenting XC functional. So far, no detailed comparison has been made of various NN architectures on the same training data.
In this study we focus on constructing NN XC functional that output both ε xc and V xc . The developed NN consists of two parts: one part is used to evaluate ε xc (NN-E), and another approximates V xc (NN-V). They connected in such a way that the output of NN-E is one of the input features for NN-V (see Fig. 1). Proposed architecture is also "economical", since one back-propagation step does not need to traverse the entire mesh, and it is calculated only at one point. In other practical approaches 22,27 , for one step of the neural network, it is necessary to calculate the total exchange-correlation energy of the system, which significantly increases memory consumption and therefore leads to limitations for training and evaluation of neural networks with a large number of parameters. Direct differentiation of features to obtain the exchange-correlation functional from the density of the exchange-correlation energy can be quite a complex and confusing problem. In particular, if the features are semi-local or global, obtaining the final analytic expression is non-trivial. An example is the exchange energy density as a feature in the work 27 . Local coordinate transformations only complicate the task.
The training dataset was obtained from the DFT calculation of crystalline silicon, benzene, and ammonia with PBE XC functional 10,28 . After XC NN training, we implemented it into Octopus DFT code [29][30][31] and conducted self-consistent cycle calculations of train/test systems and atoms and molecules from IP13/03 dataset 32 . The mean relative error of total and XC energies on training samples is on the order of 0.001% . The same errors on crystals and molecules that were not used to train XC NN increased by the order magnitude but were still small, around 0.01% . Such a small relative error indicates that the proposed architecture could be successfully used for XC functional form representation. The key feature of the proposed NN architecture is that the weights of NN-V are pre-trained on ε xc → V xc mapping and fixed during learning on new training data. It allows being sure that the output of NN-E is indeed the ε xc , because the relation between ε xc and V xc is preserved.
It is also should be noted that the boundary conditions are included by using extra datasets. They are synthetically generated to fulfill given boundary conditions. In present case it was ε xc → 0 given that electron density (n) is vanishing, and ε xc → ε LDA xc given that gradient modulus squared of electron density ( σ ) is also approaching zero.  www.nature.com/scientificreports/ This paper is outlined as follows: In "Methods" section, we describe the data generation process, NN topology, features and hyperparameters used. In the "Results" section, we first show the results of NN training, features distributions, and evaluated metrics; in the following, we demonstrate NN XC's application in DFT calculations on atoms and molecules that were not presented in the training dataset. In "Conclusion" section, we summarize our results and suggest following research steps that could be done with the developed NN XC pipeline, with a particular emphasis on integrating data from high-level methods such as quantum MC and post-HF calculation.

Methods
The proposed neural network architecture consists of two parts: the NN-E and the NN-V. The NN-E serves to obtain the exchange-correlation energy density ( ε xc ), and the NN-V allows calculating of the exchange-correlation potential ( V xc ) from the corresponding energy using the output of NN-E as one of the neurons. Only the spin unpolarized case is considered. Generalization to the spin-polarized case is trivial and can be implemented by increasing the number of input neurons corresponding to the spin components appropriately.
The NN-E input parameters are electron density n and the square of the electron density gradient σ = �∇n, ∇n� . We use base-10 logarithmic transformation for preprocessing input features of NN-E. The input features for the NN-V are ε xc , n, σ , γ = �∇σ , ∇n� and Laplacian of the electron density n . All NN-V features except ε xc are standardized, i.e. converted to zero mean and unit variance (see Fig. 1). The choice of features for NN-V is due to the functional connection between ε xc and V xc in generalized gradient approximation (GGA), which contains the electron density, its gradients in various combinations and the Laplacian. In a real space grid the connection between potential and XC energy density looks as following 33 : where g(r) ≡ |∇n(r)| and � g ≡ ∇n(r) . In a broad sense, the choice of features is arbitrary and the only criterion for its success is the increase in the accuracy of NN.
At the first stage of training, only the NN-V is trained. In this case, the energy that is supplied to the input to the NN-V is obtained using the libXC 34 package. The first stage aims to teach mapping between the ε xc and the corresponding potential, and the ε xc known in advance is used. In this case the following loss function is applied: At the second stage, the NN-V weights are frozen, and only NN-E part is trained, but loss depends on the output of NN-V ( V xc ). The second stage aims to train the neural network to map electron density, σ and V xc . Simultaneously, the frozen NN-V provides the correct connection between ε xc and V xc . At this stage we additionally include boundary condition in the loss function: The first boundary condition is based on the fact that ε xc tends to zero at zero density with any input σ . The second boundary condition follows from the fact that with zero σ and any density ε xc leads to the corresponding energy of the local density approximation.
To obtain the training data, we carry out DFT calculations 1,2 of silicon, benzene, and ammonia. We perform the calculations in real space using the Octopus code [29][30][31] . The XC functionals used include the exchange and correlation parts from PBE 10,28 . For all atomic species pseudo-potentials SG15 are used 35 . The total number of the calculations is 10 for each chemical substance. Atomic configurations in these calculations are differed by applied strain ±5% . Grid spacing in the range from zero to 10 is used. This corresponds to 64 × 64 × 64 mesh for silicon and benzene, and 65 × 65 × 65 mesh for ammonia. The mesh size for ammonia is a little bit larger due to technical issues related to non-periodic boundary conditions. For avoiding gradient inconsistency in the boundaries we use cropping. We remove data points lying at a distance of 4 (due to numerical scheme of differentiation) or less from the boundaries of the parallelepiped. Finally, the dataset size was approximately 5.2 million samples.
The Pytorch framework 36 is utilized for training the neural network. We use the Adam algorithm for training with a learning rate descending from 0.001 by 25 percent every 20 epoch, and mean square error (MSE) loss. The batch size selected for training is 5000. The neural architecture used for the NN-E is a fully connected network with two input neurons, one linear output neuron, and four exponential linear units (ELU) 37 hidden layers with 100 neurons. NN-V has the same architecture except that the number of input neurons increased to 5. To implement boundary conditions (3) we also included two additional batches in each training step. The first one consists of zero electron density and non-zero gradients with corresponding zero ε xc . The second one contains non-zero electron density and zero gradients with corresponding ε LDA xc . Thus, on the one hand, we (3)

Results
Training of neural network. The training of NN was performed on benzene, silicon and ammonia. For each substance ten calculation was performed. In each calculation the system was stretched or compressed in range of ±5% of lattice constant to obtain electron densities in a wider range.
Training curves for NN-V and NN-E are presented in Fig. 2. The loss reaches a plateau after 300k batches for NN-V and 200k batches for NN-E correspondingly. Each batch consisted of 32 sample. The final loss achieved values of about 10 −6 Ha 2 × a 6 0 in both cases. The distributions of input features for training dataset are presented in Fig. 3. Individual distributions of substances included in the training dataset are normalized to the total number of elements corresponding to the given substance. In additional to features used distribution by the Wigner-Seitz radius r s = (3/(4π n)) 1/3 and the reduced density gradient s = |∇n|/2k F n = |∇n|/(2 3 √ 3π 2/3 n 4/3 ) are presented. All distribution is not uniform due to real systems used in training procedure. The distributions localized around the values that are typical for the systems under consideration. Analysis of such distributions is important to determine the limits  Each curve is normalized to to the total number of elements in the corresponding dataset.The distributions localized around the values that are typical for the systems used in training dataset: ammonia, benzene, crystalline silicon. One can see that crystalline silicon has relatively small electron density gradients and no region of low electron density due to its periodic structure. If a neural network is given an input value significantly out of bounds of the distribution data, it potentially could output an inappropriate value. where x n is the true value of target variable, x n is the predicted value of target variable, N is the total number of targets. The resulting MSE and MAE on training dataset are presented in the Table 1. Minimum MAE(ε xc ) and MSE(ε xc ) are obtained on benzene, maximums are achieved on silicon. MSE error on boundary conditions is on the order of 10 −10 , which is considerably less than the total MSE error on the train dataset.
You can see in Fig. 3 that the distributions of silicon are significantly different from the overall distributions of the training dataset. One can see that crystalline silicon has relatively small electron density gradients and no region of low electron density due to its periodic structure. Therefore we attribute the maximum error of ε xc in silicon to the fact that its characteristic values of electron density and its derivatives effectively have a smaller fraction in the training dataset compared to benzene and ammonia.
We also separately analyzed the spatial distribution of the error using benzene as an example. Figure 4 shows the distribution of the relative local error |(x − x)/x| × 100% in the benzene plane for both ε xc and V xc . The spatial distribution of the error clearly shows that it has the highest values in the region of the nuclei and at the boundary. Again, we associate the high local error in these places with a few examples of this type of data in its training set. If there were more extreme examples with high and low density in the training dataset, then the errors in the nuclei region and at the border should have dropped.
Testing of neural network. We incorporate the developed NN XC functional into Octopus code and perform self-consistent cycle calculations. All calculations are converged, and relative total and XC energies errors are calculated. As a reference, we take PBE functional that was used for the training of NN. Results are presented in the Table 2. One can see that the highest error is achieved for atomic and molecular oxygen. x n − x n 2 ,  www.nature.com/scientificreports/ The reasons for the error on oxygen were analyzed. It turned out that the main reason for such a discrepancy on some substances, including oxygen, is the intermediate densities that occurred in the process of a self-consistent cycle that the neural network did not see during the training process. The problem is that such densities can be arbitrarily large. The distributions of maximum electron densities in some cases are presented in Fig. 5 together with the training distribution. Analytical exchange-correlation functional contains asymptotics at infinity, which makes it possible to process such cases correctly. However, only boundary conditions at zero were used during our training cycle, but asymptotics at infinity were not used. This critical fact describes obtained discrepancy and gives hope for the further development and improvement of neural network exchange-correlation functionals by including such asymptotics 38,39 .
To show the performance of developed NN XC on a chemically meaningful quantity we included results for Alkisomer11 11 dataset (see Table 3), for which we made PBE vs NN XC comparison of isomerization energies. Our model demonstrates results comparable to PBE, mean absolute error (MAE) for PBE is 1.6 kcal/mol and MAE for NN XC is 1.7 kcal/mol, which is a quite promising result. As it turned out as a result of the analysis, large errors on octane isomers even for PBE are explained by the fact that the corresponding geometries published in the article 11 are not completely relaxed.
To verify the importance of nonequilibrium densities for convergence of the SCF cycle we expanded the training dataset to include higher densities typical for intermediate densities of oxygen atom (O), which are obtained in self-consistent cycles (see Fig. 5). We stopped SCF iteration at 2,4,6,8,10,12,14,16,18, and 20 iterations and added 10 corresponding SCF results to the training set. We used spacing 0.273 a 0 and 64 × 64 × 64 mesh. The resulting neural network performed quite better for O 2 , the relative error of total energy drops from 0.646 to 0.023% and the relative error of XC energy drops from 2.5 to 0.17%. Thus, we also have shown that the inclusion   Fig. 6). There is a high similarity between the enhancement factors at densities that closely match the most frequent densities in the training set ( r s = 2). We observe slightly less similarity at r s = 10 . This can be explained by the decreased number of elements with such a density in the training set. A significant discrepancy is observed in the limiting cases of high and low densities. In the case of high density (small r s ), this is explained by the absence of asymptotic at infinity. Despite  Figure 6. Enhancement factor F xc at fixed values of r s : 0, 2, 10, 1000. s is a reduced density gradient. There is a good agreement between the enhancement factors at r s = 2 densities, which is close to the most frequent densities in the training set. Also one can see that NN XC functional is bad in guessing asymptotic in low ( r s = 1000) and high ( r s = 0) limits of electron density. www.nature.com/scientificreports/ the presence of asymptotic at zero densities in the process of training a neural network, the obtained result shows that this asymptotic may not be enough for obtaining a good enhancement factor. Even though our model correctly reproduces the energies, the true exchange-correlation functional must satisfy many important physical constraints. Therefore, it is very crucial to discuss how such constraints can be added to the model and learning process in our future work. Firstly, it is important to mention that constraints use total exchange or correlation energies, so correct implementation of constraints can be effectively introduced only while training the model on total energies (much better, reaction energies). For example, an important property of such a functional is the density scaling property. In the article 40 , it was shown that this property can be effectively taken into account using contrastive representation learning. Concerning our model, this means that the densities and all values obtained from it should be fed into NN-E and NN-V not in a "pure" form, but transformed using an encoder trained in an unsupervised way. As stated in the article mentioned above the NTxent loss is a good choice. This loss causes the projected representations of unscaled and scaled densities of the same molecule to get close and the representations of different molecules to be far from each other. With another module predicting scaling factors this approach can help to incorporate uniform density scaling constraint. In the case of our work, we did not actively investigate the imposition of any constraints, this is the topic of ongoing research, but we checked that the presented functional does not lose accuracy when the SH molecule was rotated (see Table 4). As a result of additional calculations, it turned out that our model demonstrates excellent rotational equivariance with a total energy difference significantly smaller than 1 kcal/mol. XC energy density and the XC potential are related with a functional derivative (see for example Eq. 1) Therefore, we compared the exchange-correlation potential predicted by the NN-V and the corresponding potential calculated using the functional derivative. Indirectly, the fact that the analytical relationship between ε xc and V xc is preserved can be shown by the fact that the errors both on energy and on potential are quite small at the same time (see Table 1). If the neural network did not correctly reproduce this connection, then, accordingly, the error on the potential would be quite large. We aslo demonstrated these results clearly on the hydrogen molecule and built distributions of ε xc and V xc along the line connecting the hydrogen atoms (see Fig. 7).

Conclusions
The second NN (such as NN-V) that directly predicts the XC potential is avoided in recent publications 26,41 , by imposing automatic differentiation of the NN. Automatic differentiation is a remarkable property of neural networks, which is actively used for developing neural network exchange-correlation potentials. However, it has some disadvantages: if the neural network is deep enough then automatic differentiation makes much noise, especially if we talk about higher-order derivatives (most of the implementations in literature use DFT codes based on atomic orbitals, where only first-order derivatives are needed); for simple features, it is pretty straightforward to write an analytical form of the relationship between the potential and the XC energy. However, if we take complex and non-standard features (semi-local,global ones), it can be practically impossible to connect the potential and the exchange-correlation energy. This is where the advantage of the second neural network becomes important-it provides an alternative path for an approximation that does not involve taking partial derivatives. This allows using more complex features in the machine learning pipeline. Low error on the validation dataset indicates that the developed approach to the architecture of the XC functional interpolates well the existing XC functionals and has the high generalizing ability. Furthermore, in the framework of the proposed architecture, one can train NN-E and NN-V separately, making it flexible to use. The basic strategy to create working functional in the framework of the proposed architecture would be initially to train the NN-V part on a specific type of existing XC functional such as LDA, GGA, or meta-GGA, using various types of input features. Then the NN-E part of the neural network is trained on the data obtained by accurate post-Hartree-Fock or quantum Monte Carlo methods.
In this work, we did not take into account that the resulting functional may violate the rules of critical physical conditions 41,42 . However, all these conditions can be introduced by modifying the training loop or loss function and, therefore, can be explicitly taken into account. This is a matter for further research.
The main advantage of the NN approach in comparison with other interpolation techniques for XC functionals is its flexibility to incorporate exchange-correlation data from different sources, such as post-Hartree-Fock and quantum Monte Carlo. It is possible that application of the NN to interpolate high-level XC quantum data could eliminate many heuristics used in the traditional construction of XC functionals.

Data availability
Optimized NN weights and code to use it in calculations are available at https:// github. com/ Alexa nderF reeman/ octop us-nn-2.